Text Corpus
   HOME

TheInfoList



OR:

In
linguistics Linguistics is the scientific study of human language. It is called a scientific study because it entails a comprehensive, systematic, objective, and precise analysis of all aspects of language, particularly its nature and structure. Linguis ...
, a corpus (plural ''corpora'') or text corpus is a
language resource In linguistics and language technology, a language resource is a " ompositionof linguistic material used in the construction, improvement and/or evaluation of language processing applications, (...) in language and language-mediated research stu ...
consisting of a large and structured set of texts (nowadays usually electronically stored and processed). In
corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
, they are used to do statistical analysis and
hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
, checking occurrences or validating linguistic rules within a specific language territory. In search technology, a corpus is the collection of documents which is being searched.


Overview

A corpus may contain texts in a single language (''monolingual corpus'') or text data in multiple languages (''multilingual corpus''). In order to make the corpora more useful for doing linguistic research, they are often subjected to a process known as
annotation An annotation is extra information associated with a particular point in a document or other piece of information. It can be a note that includes a comment or explanation. Annotations are sometimes presented in the margin of book pages. For anno ...
. An example of annotating a corpus is
part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
, or ''POS-tagging'', in which information about each word's part of speech (verb, noun, adjective, etc.) is added to the corpus in the form of ''tags''. Another example is indicating the lemma (base) form of each word. When the language of the corpus is not a working language of the researchers who use it,
interlinear gloss In linguistics and pedagogy, an interlinear gloss is a gloss (series of brief explanations, such as definitions or pronunciations) placed between lines, such as between a line of original text and its translation into another language. When gloss ...
ing is used to make the annotation bilingual. Some corpora have further ''structured'' levels of analysis applied. In particular, smaller corpora may be fully
parsed Parsing, syntax analysis, or syntactic analysis is the process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gra ...
. Such corpora are usually called
Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
s or Parsed Corpora. The difficulty of ensuring that the entire corpus is completely and consistently annotated means that these corpora are usually smaller, containing around one to three million words. Other levels of linguistic structured analysis are possible, including annotations for
morphology Morphology, from the Greek and meaning "study of shape", may refer to: Disciplines * Morphology (archaeology), study of the shapes or forms of artifacts * Morphology (astronomy), study of the shape of astronomical objects such as nebulae, galaxies ...
,
semantics Semantics (from grc, σημαντικός ''sēmantikós'', "significant") is the study of reference, meaning, or truth. The term can be used to refer to subfields of several distinct disciplines, including philosophy Philosophy (f ...
and
pragmatics In linguistics and related fields, pragmatics is the study of how context contributes to meaning. The field of study evaluates how human language is utilized in social interactions, as well as the relationship between the interpreter and the int ...
.


Applications

Corpora are the main knowledge base in
corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
. Other notable areas of application include: *
Language technology Language technology, often called human language technology (HLT), studies methods of how computer programs or electronic devices can analyze, produce, modify or respond to human texts and speech. Working with language technology often requires broa ...
,
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
,
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
** The analysis and processing of various types of corpora are also the subject of much work in
computational linguistics Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, comput ...
,
speech recognition Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the m ...
and
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
, where they are often used to create
hidden Markov model A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process — call it X — with unobservable ("''hidden''") states. As part of the definition, HMM requires that there be an ob ...
s for part of speech tagging and other purposes. Corpora and
frequency list A word list (or ''lexicon'') is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by f ...
s derived from them are useful for language teaching. Corpora can be considered as a type of
foreign language writing aid A foreign language writing aid is a computer program or any other instrument that assists a non-native language user (also referred to as a foreign language learner) in writing decently in their target language. Assistive operations can be classifie ...
as the contextualised grammatical knowledge acquired by non-native language users through exposure to authentic texts in corpora allows learners to grasp the manner of sentence formation in the target language, enabling effective writing.Yoon, H., & Hirvela, A. (2004)
ESL Student Attitudes toward Corpus Use in L2 Writing
''Journal of Second Language Writing, 13''(4), 257–283. Retrieved 21 March 2012.
*
Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
** Multilingual corpora that have been specially formatted for side-by-side comparison are called ''aligned parallel corpora''. There are two main types of
parallel corpora A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
which contain texts in two languages. In a ''translation corpus'', the texts in one language are translations of texts in the other language. In a ''comparable corpus'', the texts are of the same kind and cover the same content, but they are not translations of each other. To exploit a parallel text, some kind of text alignment identifying equivalent text segments (phrases or sentences) is a prerequisite for analysis.
Machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
algorithms for translating between two languages are often trained using parallel fragments comprising a first-language corpus and a second-language corpus, which is an element-for-element translation of the first-language corpus. * Philologies ** Text corpora are also used in the study of
historical document Historical documents are original documents that contain important historical information about a person, place, or event and can thus serve as primary sources as important ingredients of the historical methodology. Significant historical documen ...
s, for example in attempts to
decipher DECIPHER is a web-based resource and database of genomic variation data from analysis of patient DNA. It documents submicroscopic chromosome abnormalities ( microdeletions and duplications) and pathogenic sequence variants (single nucleotide ...
ancient scripts, or in
Biblical scholarship Biblical criticism is the use of critical analysis to understand and explain the Bible. During the eighteenth century, when it began as ''historical-biblical criticism,'' it was based on two distinguishing characteristics: (1) the concern to ...
. Some archaeological corpora can be of such short duration that they provide a snapshot in time. One of the shortest corpora in time may be the 15–30 year
Amarna letters The Amarna letters (; sometimes referred to as the Amarna correspondence or Amarna tablets, and cited with the abbreviation EA, for "El Amarna") are an archive, written on clay tablets, primarily consisting of diplomatic correspondence between t ...
texts (
1350 BC Events and trends * c. 1356 BC – Amenhotep IV begins the worship of Aten in Ancient Egypt, changing his name to Akhenaten and moving the capital to Akhetaten, starting the Amarna Period. * c. 1352 BC – Amenhotep III ( Eighteenth ...
). The ''corpus'' of an ancient city, (for example the "
Kültepe Kültepe ( Turkish: ''ash-hill''), also known as Kanesh or Nesha, is an archaeological site in Kayseri Province, Turkey, inhabited from the beginning of the 3rd millennium BC, in the Early Bronze Age.Kloekhorst, Alwin, (2019)Kanišite Hittite: ...
Texts" of Turkey), may go through a series of corpora, determined by their find site dates.


Some notable text corpora


See also

*
Concordance Concordance may refer to: * Agreement (linguistics), a form of cross-reference between different parts of a sentence or phrase * Bible concordance, an alphabetical listing of terms in the Bible * Concordant coastline, in geology, where beds, or la ...
*
Corpus linguistics Corpus linguistics is the study of language, study of a language as that language is expressed in its text corpus (plural ''corpora''), its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feas ...
*
Distributional–relational database A distributional–relational database, or word-vector database, is a Database Management System, database management system (DBMS) that uses distributional word embedding, word-vector representations to enrich the semantics of data model, structur ...
*
Linguistic Data Consortium The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and developm ...
*
Natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
*
Natural Language Toolkit The Natural Language Toolkit, or more commonly NLTK, is a suite of Library (computer science), libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python (programming language), Python ...
*
Parallel text alignment A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Libr ...
*
Search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
: they access the "web corpus". *
Speech corpus A speech corpus (or spoken corpus) is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models (which can then be used with a speech recognition or spea ...
*
Translation memory A translation memory (TM) is a database that stores "segments", which can be sentences, paragraphs or sentence-like units (headings, titles or elements in a list) that have previously been translated, in order to aid human translators. The translat ...
*
Treebank In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empiri ...
* Zipf's Law


References


External links


ACL SIGLEX Resource Links: Text Corpora

Developing Linguistic Corpora: a Guide to Good Practice

Free samples (not free), web-based corpora (45-425 million words each): American (COCA, COHA, TIME), British (BNC), Spanish, Portuguese

Intercorp
Building synchronous parallel corpora of the languages taught at the Faculty of Arts of Charles University.
Sketch Engine: Open corpora with free access

TS Corpus – A Turkish Corpus freely available for academic research.

Turkish National Corpus – A general-purpose corpus for contemporary Turkish

Corpus of Political Speeches
Free access to political speeches by American and Chinese politicians, developed by Hong Kong Baptist University Library
Russian National Corpus
{{Natural Language Processing Discourse analysis Corpus linguistics Computational linguistics Works based on multiple works Test items lt:Tekstynas